Persian Plagiarism Detection Using Sentence Correlations
نویسندگان
چکیده
This report explains our Persian plagiarism detection system which we used to submit our run to Persian PlagDet competition at FIRE 2016. The system was constructed through four main stages. First is pre-processing and tokenization. Second is constructing a corpus of sentences from combination of source and suspicious document pair. Each sentence considered to be a document and represented as a tf-idf vector. Third step is to construct a similarity matrix between source and suspicious document. Finally the most similar documents which their similarity is higher than a specific threshold marked as plagiarized segments. Our performance measures on the training corpus were promising (precision=0.914, recall=0.848, granularity=3.85). CCS Concepts • Information systems➝Information retrieval ➝Retrieval tasks and goals. Near-duplicate and plagiarism detection. [1] [2] [3] [4] [5] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17]
منابع مشابه
External Plagiarism Detection based on Human Behaviors in Producing Paraphrases of Sentences in English and Persian Languages
With the advent of the internet and easy access to digital libraries, plagiarism has become a major issue. Applying search engines is one of the plagiarism detection techniques that converts plagiarism patterns to search queries. Generating suitable queries is the heart of this technique and existing methods suffer from lack of producing accurate queries, Precision and Speed of retrieved result...
متن کاملA Deep Learning Approach to Persian Plagiarism Detection
Plagiarism detection is defined as automatic identification of reused text materials. General availability of the internet and easy access to textual information enhances the need for automated plagiarism detection. In this regard, different algorithms have been proposed to perform the task of plagiarism detection in text documents. Due to drawbacks and inefficiency of traditional methods and l...
متن کاملDeveloping Bilingual Plagiarism Detection Corpus Using Sentence Aligned Parallel Corpus: Notebook for PAN at CLEF 2015
Plagiarism detection is the process of locating text reuse within a suspicious document. The plagiarism detection corpora are used for evaluating plagiarism detection systems. In this paper, we present a bilingual PersianEnglish plagiarism detection corpus. We provide our corpus for the task of text alignment corpus construction in the PAN 2015 competition. Our approach is based on parallel cor...
متن کاملDeveloping Monolingual Persian Corpus for Extrinsic Plagiarism Detection Using Artificial Obfuscation: Notebook for PAN at CLEF 2015
The task of text alignment corpus construction at PAN 2015 competition consists of preparing a plagiarism corpus so that it can provide various obfuscation types and versatile obfuscation degrees. Meanwhile, its format and metadata structure should follow previous PAN plagiarism corpora. In this paper, we describe our approach for construction of a monolingual Persian plagiarism corpus that can...
متن کاملDesign a Persian Automated Plagiarism Detector (AMZPPD)
Currently there are lots of plagiarism detection approaches. But few of them implemented and adapted for Persian languages. In this paper, our work on designing and implementation of a plagiarism detection system based on preprocessing and NLP technics will be described. And the results of testing on a corpus will be presented. Keywords— External Plagiarism, Plagiarism, Copy detection, natural ...
متن کامل